price and other variables.
Project 3 is due by 7:00 p.m. EST on Tuesday, December 16, and just a reminder that no late pass is allowed for Project 3, as the deadline is the final assessment deadline set by the university. You must submit the project by the deadline. Please submit both a .pdf and a .rmd files on Gradescope by the deadline.
Please do not email your work to any of the instructors in any case (unless you have no access to Gradescope at the time when the assignment is due–in this case please email the Head TA (see syllabus for contact info) both copies of your .pdf and .rmd files before the deadline and upload the unaltered copies of the files to Gradescope as soon as you regain access to Gradescope). We need your files to be on Gradescope in order to grade them. Thus, even if you are late, please still submit the files on Gradescope.
This assignment can be completed in groups of up to 3 students. Feel free to use Ed Discussion to recruit team members. It is okay to work by yourself, if this is preferred. You are welcome to get help (you can either ask questions on Ed Discussion or talk to the instructors during office hours) from the instructors; however, please do not post code/solutions on Ed Discussion on a public post. Please see the Academic Integrity Policy tab on Canvas about the course policy.
For problem sets and projects you may not work with a given student on more than two assignments. For example, if you work with student A on Problem set 1 and Project 1, then you cannot work with student A on any other problem sets and projects.
When working in a group it is your responsibility to make sure that you are satisfied with all parts of the report and the submission is on time (e.g., we will not entertain arguments that deficiencies are the responsibility of other group members). We expect that you each work independently first and then compare your answers with each other once you all finish, or help each other if someone in your group gets stuck. Failing to make contributions and then putting your name on a report will be considered a violation of the honor code. Please do not divide work among group mates. Everyone in your group is responsible for being able to explain the solutions in the report.
For all parts of this project, you MUST use R commands to print the output as part of your R Markdown file. You are not permitted to find the answer in the R console and then copy and paste the answer into your report. All code and outputs should be visible on your pdf file.
If you are completing this project in a group, please have only one person in your group turn in the .Rmd and .pdf files. Make sure that this person selects the names of all the group mates on Gradescope–if you are not sure about how to make a group submission please see the video posted under Assignments > Precept submission instructions on Canvas. To receive full credit for this assignment, you must select the pages properly for each problem on Gradescope; again, if you are not sure about how to do this, please watch the video mentioned.
Late problem sets and projects (i.e., being late even after you use a late pass) will be penalized at intervals rounded up to multiples of 24 hours. You will receive a 10% deduction on the score for each 24-hour-late-interval. For example, if you are 3 hours late, 10% off or if you are 30 hours late, 20% off. A late pass is not allowed for assignments that are due on Dean’s date.
Please type your name(s) after “Digitally signed:” below the honor pledge to serve as digital signature(s).
I pledge my honor that I have not violated the honor code when completing this assignment.
Digitally signed: Cindy Ruoheng Li
When reporting numerical findings in the narrative
part of your report, please round answers to an appropriate number
of digits. This means that usually having 2 decimal places (e.g., 45.71)
is good; however, if the answer has a lot of leading zeros (e.g.,
0.000034), you might want to include at 2 or more non-zero digits. Just
make sure that you do not report answers in long strings of numbers
(e.g., 5.043289); this will make your report look
unprofessional. When printing out a data frame as part of an
answer, you can use print.data.frame(your_df, digits = 2);
this will allow you to print out the entire data frame
your_df and suggest to R that you want to have
at least 2 significant digits–you might want to adjust the
value for digits based on individual cases.
Remember not to round intermediate calculations and please avoid
hard-coding. For example, if we updated the dataset
loan_data.csv with additional rows, your code should still work
without any modifications.
Unless otherwise specified, please provide code and the output of the code to support all your answers.
In general if your report contains plots, in order to receive full credits, please have sensible titles and axis labels for all your graphs and adjust values for all the relevant graphical parameters so that your plots are informative. Also, all answers must be written in complete sentences.
(Hypothetical) Congrats! You have been hired as an intern at Redfin (Redfin.com) in Seattle. For your first project, your manager would like you to build a new model to predict house prices in the King County area of Washington State.
To view the predicted sold prices by Redfin’s current model, refer to the number displayed in the Redfin Estimate for … section. For instance, for this house (https://www.redfin.com/NJ/Princeton/17-Aiken-Ave-08540/home/36691957), as of Nov. 29, 2025, the Redfin estimate is $1,129,552 compared to the asking price of $1.145 million.
For this project we will use a dataset1 that contains transactional information for houses sold between May 2014 and May 2015 in King County of Washington, including Seattle.
The data for this project is stored in the file
house_data_cleaned.csv, and the definitions for the
variables in the dataset are listed below.
BLDGGRADE on https://www5.kingcounty.gov/sdc/FGDCDocs/RESBLDG_EXTR_faq.htm)renov_Y.N, 2 for the
rest)Read the dataset house_data_cleaned.csv into
R and name it house. house should
have 21605 rows and 15 columns.
Convert the following variables to categorical variables:
waterfront, floors, condition,
and zipcode. (I have checked and there is not much linear
relationship between the numeric version of floors and
price, and between the numeric version of
condition and price. Therefore, it will be
better to treat these variables as categorical variables.)
Also, add an additional column renov_Y.N to
house. renov_Y.N should be a categorical
variable. An element in renov_Y.N is Y if the
corresponding element in yr_renovated is greater than zero,
and N otherwise. The function dplyr::if_else() might be
helpful here–you can check out its help manual–but you are not required
to use it.
Print the summary statistics for each column in the data frame
house (i.e., the five-number summary and mean for numeric
columns, and the frequency count for each level of categorical columns)
after completing all the modifications above. (Hint: This can be done in
a single line of code.)
Answer
house <- read.csv("/Users/cindyli/Desktop/house_data_cleaned.csv")
dim(house)
[1] 21605 15
#convert variables to categorical variables
house$waterfront <- as.factor(house$waterfront)
house$floors <- as.factor(house$floors)
house$condition <- as.factor(house$condition)
house$zipcode <- as.factor(house$zipcode)
# Add renov_Y.N column
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
house <- house |>
mutate(renov_Y.N = factor(yr_renovated > 0, labels = c("N", "Y")))
#summary of each column
house_summary <- summary(house)
print(house_summary)
date price bedrooms
Length:21605 Min. : 75000 Min. : 0.000
Class :character 1st Qu.: 322000 1st Qu.: 3.000
Mode :character Median : 450000 Median : 3.000
Mean : 540090 Mean : 3.371
3rd Qu.: 645000 3rd Qu.: 4.000
Max. :7700000 Max. :11.000
bathrooms sqft_living sqft_lot floors
Min. :0.000 Min. : 370 Min. : 520 1 :10677
1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1.5: 1910
Median :2.250 Median : 1910 Median : 7620 2 : 8238
Mean :2.115 Mean : 2080 Mean : 15109 2.5: 161
3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3 : 612
Max. :8.000 Max. :13540 Max. :1651359 3.5: 7
waterfront condition grade sqft_above
0:21442 1: 29 Min. : 3.000 Min. : 370
1: 163 2: 172 1st Qu.: 7.000 1st Qu.:1190
3:14026 Median : 7.000 Median :1560
4: 5678 Mean : 7.657 Mean :1788
5: 1700 3rd Qu.: 8.000 3rd Qu.:2210
Max. :13.000 Max. :9410
sqft_basement yr_built yr_renovated
Min. : 0.0 Min. :1900 Min. : 0.00
1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.00
Median : 0.0 Median :1975 Median : 0.00
Mean : 291.6 Mean :1971 Mean : 84.43
3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.00
Max. :4820.0 Max. :2015 Max. :2015.00
zipcode renov_Y.N
98103 : 601 N:20691
98038 : 590 Y: 914
98115 : 583
98052 : 574
98117 : 553
98042 : 548
(Other):18156
price and other
variables.Is the distribution of the variable price symmetric or
skewed? If it is skewed, does it have a long left or right tail? Please
provide 1 graph to support your statement.
Answer
library(ggplot2)
#calculate numerical values to measure skewness
price_summary <- summary(house$price)
print(price_summary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
75000 322000 450000 540090 645000 7700000
#I suggest that the price is right-skewed with a long right tail because
#the mean is higher than the median, the bigger data drew the mean to be higher
#than the median
#Graph Justification
Graph_justification <- ggplot(house, aes(x = price)) +
geom_histogram(aes(y = after_stat(density)),
bins = 50,
fill = "steelblue",
alpha = 0.7,
color = "white") +
geom_density(color = "darkred", linewidth = 1) +
labs(title = "Distribution of House Prices",
x = "Price ($)",
y = "Density")
print(Graph_justification)
#The graph justified my speculation that the price dataset does have a long
#right tail
majority)To ensure the quality of our analysis, we will exclude houses that have extreme sold prices, i.e., houses with prices greater than or equal to $4,000,000 in our analysis. How many houses will we exclude, out of 21605 houses?
Name the dataset with house prices less than $4,000,000
majority. What is the dimensions of
majority?
Also, replace the values in the price column in
majority with the natural logarithm of the prices. The goal
of doing the logarithmic transformation is to make the data fit the
assumption of the model better when we build linear models later.
Answer
#count the number of houses to exclude
houses_to_exclude <- sum(house$price >= 4000000)
#out of 21605 houses, 12 houses will be excluded
#create the majority dataset
library(dplyr)
majority <- house |>
filter(price < 4000000)
dim(majority)
[1] 21593 16
#the dimension of the majority dataset is 21593, 16
#Replace price column with the natural logarithm of the prices
majority$price <- log(majority$price)
#Graphic result of the log conversion of the majority data set
ggplot(majority, aes(x = price)) +
geom_histogram(aes(y = after_stat(density)),
bins = 50,
fill = "steelblue",
alpha = 0.7,
color = "white") +
geom_density(color = "darkred", linewidth = 1) +
labs(title = "Distribution of House Prices (Natural Log)",
x = "Price ($)",
y = "Density")
Use ggpairs() in the GGally package to make
matrix plots. Investigate the pairwise relationships between the
variables; give different colors to the data points depending on whether
the house has a view to a waterfront. Note that due to the large number
of variables, you will need to use several matrix plots since one matrix
plot will not fit all the variables. Make sure all the relevant
information that you want to display on the matrix plots is legible.
Include sqft_living, sqft_lot,
sqft_above and sqft_basement in the same plot
to see if any of these variables are correlated. Among these four
variables, which pair has the largest correlation coefficient? In linear
modeling, is it good to include predictors that are highly correlated
with one another (i.e., with correlations close to -1 or 1) in a model?
Explain why or why no.
Since we have limited time for this project I have investigated the
relationship between the month of sale and price and did
not see any patterns; therefore, we will not include the variable
date in our model or the matrix plots.
Also, note that including zipcode in a scatterplot
matrix will not be helpful since the graph will be too small to show the
relationship between price and zipcode; thus,
we will investigate the relationship between price and
zipcode separately later in the report.
Hints: If you are confused about this part, consider the following:
The matrix plots are intended to help prepare for the model building in the later part of the project. Think about which pairs of variables you would like to examine in order to explore their pairwise relationships for the potential linear models.
Based on the variable definitions, the variables that could potentially be strongly correlated should be placed in the same matrix plot. Why? There is more than one valid way to group the variables for the graphs, so be sure to explain the rationale for your choices in your report. Also note that some variables may appear in more than one matrix plot.
Answer
#Create the first GGpairs Graph: with `sqft_living`, `sqft_lot`, `sqft_above`
#and `sqft_basement`
library(GGally)
library(ggplot2)
# Select the variables of interest
selected_vars <- majority[, c("sqft_living", "sqft_lot", "sqft_above",
"sqft_basement", "waterfront")]
# Create the matrix plot with colors based on waterfront
pairs_plot <- ggpairs(
selected_vars,
columns = 1:4,
mapping = aes(color = waterfront, alpha = 0.7),
title = "Pairwise Relationships of Square Footage Variables",
upper = list(continuous = wrap("cor", size = 4)),
lower = list(continuous = wrap("points", size = 0.5)),
diag = list(continuous = wrap("densityDiag", alpha = 0.5))
)
print(pairs_plot)
#Among these four variables, sqft_living and sqft_above has the highest
#correlation, with a numerical value close to 0.9
#No, it is generally not good to include highly correlated predictors
# (multicollinearity) in a linear model because it will lead to an increase
#in standard error. So sqft_living and sqft_above should not be both included
#in the training model
For my final project, I want to group these variables into two group of matrix plot, the first is based on structure&size, second is construction details&age
library(dplyr)
library(GGally)
library(ggplot2)
# Matrix 1: Structural and Size Variables
structural_vars <- c("price", "sqft_living", "sqft_lot", "bedrooms", "bathrooms",
"floors", "grade", "waterfront")
# Plot 1: Correlation matrix for continuous variables
matrix_structural <- ggpairs(
majority[, structural_vars],
mapping = aes(color = waterfront, alpha = 0.6),
title = "Pairwise Relationships: Structural and Size Variables",
upper = list(
continuous = wrap("cor", size = 2),
discrete = wrap("count", size = 1.5)
),
lower = list(
continuous = wrap("points", size = 0.05),
combo = wrap("box_no_facet",
linewidth = 0.2,
outlier.shape = NA),
discrete = wrap("count", size = 1.5)
),
diag = list(
continuous = wrap("densityDiag", alpha = 0.25),
discrete = wrap("barDiag")
)
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(size = 8)
)
# Display the plot
print(matrix_structural)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2
3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the GGally
package.
Please report the issue at
<https://github.com/ggobi/ggally/issues>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this
warning was generated.
#There are high correlation between price and sqft_living, price and bathrooms
#followed by price and bedrooms and price & sqft_lot has low correlations
#There is some correlation between price and floors, not very strong
#There is a strong correlation between price and grade of the house
#There is a strong correlation bewteen price and the waterfront
#Among the variables, sqft_living and has strong correlation with bedrooms
#bathrooms and grades
#bedrooms have strong correlation with bathrooms
library(dplyr)
library(GGally)
library(ggplot2)
# Matrix 2: Construction Details & Age
construction_vars <- c("price", "sqft_above", "sqft_basement" ,
"yr_built", "yr_renovated","condition","renov_Y.N" )
# Plot 1: Correlation matrix for continuous variables
matrix_construction <- ggpairs(
majority[, c(construction_vars, "waterfront")],
mapping = aes(color = waterfront, alpha = 0.6),
title = "Pairwise Relationships: Construction Details & Age",
upper = list(
continuous = wrap("cor", size = 2),
discrete = wrap("count", size = 1.5)
),
lower = list(
continuous = wrap("points", size = 0.05),
combo = wrap("box_no_facet", linewidth = 0.2, outlier.shape = NA),
discrete = wrap("count", size = 1.5)
),
diag = list(
continuous = wrap("densityDiag", alpha = 0.25),
discrete = wrap("barDiag")
)
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(size = 8)
)
# Display the plot
print(matrix_construction)
#There are strong correlation between price and sqft_above, some correlation
#between price and sqft_basement
#There are NO strong correlation between price and yr_built, price and yr_renovated
#There are no strong correlation between price and conditions
#There are also no strong correlation between price and if it has been renovated or not
Make side by side boxplots to compare the values in
price by zip code. Based on your boxplots, is it good to
include the dummy variables for some of the zip codes in your model?
Explain.
Answer
library(ggplot2)
ggplot(majority, aes(x = zipcode, y = price)) +
geom_boxplot(fill = "lightblue", alpha = 0.7, outlier.size = 0.5) +
labs(
title = "House Price Distribution by Zip Code",
subtitle = "Log-transformed prices",
x = "Zip Code",
y = "log(Price)"
)+
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 8)
)
#It is good to include some zip code in my boxplots
#Because from the boxplots we can see that there are clear differences between
#the median prices of zip codes. And this make sense because location has a huge
#impact on house prices
#But including too many dummy variables is also not necessary, because many zip
#code may only have a few observation and their are also multicollinearity risk
#for zip code that has price close to each other, this could lead to an
#overfitting dataset.
Consider the following model (you will need to remove
eval=F to run the code below):
summary(lm(price~factor(zipcode), data = majority))
Call:
lm(formula = price ~ factor(zipcode), data = majority)
Residuals:
Min 1Q Median 3Q Max
-1.73975 -0.22711 -0.02209 0.19621 2.05618
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.49312 0.01893 660.132 < 2e-16 ***
factor(zipcode)98002 -0.15542 0.03175 -4.896 9.87e-07 ***
factor(zipcode)98003 0.04673 0.02863 1.632 0.102689
factor(zipcode)98004 1.50075 0.02775 54.085 < 2e-16 ***
factor(zipcode)98005 1.06698 0.03358 31.772 < 2e-16 ***
factor(zipcode)98006 1.06588 0.02487 42.865 < 2e-16 ***
factor(zipcode)98007 0.79309 0.03571 22.210 < 2e-16 ***
factor(zipcode)98008 0.78519 0.02855 27.503 < 2e-16 ***
factor(zipcode)98010 0.35845 0.04063 8.822 < 2e-16 ***
factor(zipcode)98011 0.58114 0.03196 18.185 < 2e-16 ***
factor(zipcode)98014 0.40488 0.03743 10.818 < 2e-16 ***
factor(zipcode)98019 0.42157 0.03223 13.081 < 2e-16 ***
factor(zipcode)98022 0.10559 0.03018 3.499 0.000468 ***
factor(zipcode)98023 0.01930 0.02485 0.777 0.437281
factor(zipcode)98024 0.63356 0.04443 14.258 < 2e-16 ***
factor(zipcode)98027 0.75938 0.02592 29.294 < 2e-16 ***
factor(zipcode)98028 0.50656 0.02855 17.744 < 2e-16 ***
factor(zipcode)98029 0.78602 0.02759 28.494 < 2e-16 ***
factor(zipcode)98030 0.07563 0.02938 2.574 0.010053 *
factor(zipcode)98031 0.09133 0.02884 3.167 0.001545 **
factor(zipcode)98032 -0.09494 0.03732 -2.544 0.010964 *
factor(zipcode)98033 0.98826 0.02565 38.522 < 2e-16 ***
factor(zipcode)98034 0.58310 0.02440 23.897 < 2e-16 ***
factor(zipcode)98038 0.26728 0.02403 11.124 < 2e-16 ***
factor(zipcode)98039 1.91159 0.05576 34.283 < 2e-16 ***
factor(zipcode)98040 1.39860 0.02863 48.843 < 2e-16 ***
factor(zipcode)98042 0.10530 0.02437 4.320 1.57e-05 ***
factor(zipcode)98045 0.42067 0.03071 13.697 < 2e-16 ***
factor(zipcode)98052 0.84223 0.02415 34.869 < 2e-16 ***
factor(zipcode)98053 0.86400 0.02604 33.177 < 2e-16 ***
factor(zipcode)98055 0.08601 0.02899 2.967 0.003013 **
factor(zipcode)98056 0.36482 0.02601 14.025 < 2e-16 ***
factor(zipcode)98058 0.22973 0.02534 9.064 < 2e-16 ***
factor(zipcode)98059 0.53138 0.02519 21.097 < 2e-16 ***
factor(zipcode)98065 0.63305 0.02787 22.716 < 2e-16 ***
factor(zipcode)98070 0.52954 0.03813 13.888 < 2e-16 ***
factor(zipcode)98072 0.69931 0.02884 24.248 < 2e-16 ***
factor(zipcode)98074 0.88870 0.02552 34.821 < 2e-16 ***
factor(zipcode)98075 1.04285 0.02680 38.910 < 2e-16 ***
factor(zipcode)98077 0.87024 0.03180 27.367 < 2e-16 ***
factor(zipcode)98092 0.17242 0.02695 6.397 1.62e-10 ***
factor(zipcode)98102 1.04484 0.04017 26.012 < 2e-16 ***
factor(zipcode)98103 0.72852 0.02394 30.426 < 2e-16 ***
factor(zipcode)98105 1.06560 0.03038 35.079 < 2e-16 ***
factor(zipcode)98106 0.13466 0.02728 4.937 8.01e-07 ***
factor(zipcode)98107 0.73387 0.02906 25.257 < 2e-16 ***
factor(zipcode)98108 0.24212 0.03245 7.460 8.96e-14 ***
factor(zipcode)98109 1.09223 0.03930 27.793 < 2e-16 ***
factor(zipcode)98112 1.28615 0.02896 44.408 < 2e-16 ***
factor(zipcode)98115 0.78480 0.02408 32.589 < 2e-16 ***
factor(zipcode)98116 0.77478 0.02739 28.292 < 2e-16 ***
factor(zipcode)98117 0.72404 0.02433 29.759 < 2e-16 ***
factor(zipcode)98118 0.34882 0.02475 14.092 < 2e-16 ***
factor(zipcode)98119 1.06679 0.03257 32.753 < 2e-16 ***
factor(zipcode)98122 0.79002 0.02836 27.862 < 2e-16 ***
factor(zipcode)98125 0.50287 0.02595 19.377 < 2e-16 ***
factor(zipcode)98126 0.40257 0.02690 14.968 < 2e-16 ***
factor(zipcode)98133 0.34112 0.02491 13.695 < 2e-16 ***
factor(zipcode)98136 0.65367 0.02915 22.424 < 2e-16 ***
factor(zipcode)98144 0.66692 0.02711 24.598 < 2e-16 ***
factor(zipcode)98146 0.16194 0.02841 5.700 1.21e-08 ***
factor(zipcode)98148 0.02086 0.05125 0.407 0.683969
factor(zipcode)98155 0.37953 0.02547 14.901 < 2e-16 ***
factor(zipcode)98166 0.41984 0.02945 14.257 < 2e-16 ***
factor(zipcode)98168 -0.14866 0.02896 -5.133 2.88e-07 ***
factor(zipcode)98177 0.80536 0.02941 27.380 < 2e-16 ***
factor(zipcode)98178 0.06626 0.02918 2.270 0.023195 *
factor(zipcode)98188 0.02549 0.03618 0.705 0.481122
factor(zipcode)98198 0.03893 0.02863 1.359 0.174019
factor(zipcode)98199 0.99399 0.02768 35.913 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3596 on 21523 degrees of freedom
Multiple R-squared: 0.5298, Adjusted R-squared: 0.5283
F-statistic: 351.4 on 69 and 21523 DF, p-value: < 2.2e-16
What is the estimate for the y-intercept of the model? Interpret the meaning of the y-intercept. What is the coefficient estimate for the dummy variable for zip code 98004? interpret this number too. It is okay to give your explanation in terms of the log(price) instead of the original price variable.
Answer The y intercept is 12.49312 The y intercept represents the reference zip code which is 98001 For zipcode 98004, the coefficient estimate is 12.49312+1.50075= 13.99387, the coefficient value 1.50075 represents the different in log price between zipcode 98004 and 98001, so adding it to the intercept would give the coefficient estimate for 98004
For this part you just need to remove the eval=F
argument for each of the code chunks below and run the code; no action
is required from you other than that. Please make sure that you
understand how each part is created.
majority covers the period from May 2014 to May 2015. We
will prepare a vector date.format that we can use to
extract out the observations that correspond to transactions closed in
May 2015.
date.format = format(as.Date(majority$date, '%Y%m%dT'), "%Y%m")
length(date.format)
[1] 21593
head(date.format)
[1] "201410" "201412" "201502" "201412" "201502" "201405"
dim(majority)
[1] 21593 16
Then, the following lines will extract out all the observations with
transactions closed in May 2015. We will use these observations for our
out-of-time test set test2.
test2 = majority |>
filter(date.format %in% '201505') |>
select(-date)
dim(test2)
[1] 645 15
For the rest of the observations run the following code chunk; this
will split the remaining observations into 2 sets: 85% of the data for
the non-test set, and 15% of the data for the in-time test set (i.e.,
test1).
# remove the `date` column from the data frame
# keep the rows that are not correspond to the period May 2015
library(tidymodels)
── Attaching packages ───────────────────── tidymodels 1.4.1 ──
✔ broom 1.0.9 ✔ rsample 1.3.1
✔ dials 1.4.2 ✔ tailor 0.1.0
✔ infer 1.0.9 ✔ tidyr 1.3.1
✔ modeldata 1.5.1 ✔ tune 2.0.1
✔ parsnip 1.3.3 ✔ workflows 1.3.0
✔ purrr 1.1.0 ✔ workflowsets 1.1.1
✔ recipes 1.3.1 ✔ yardstick 1.3.2
── Conflicts ──────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ recipes::step() masks stats::step()
tmp = majority |>
filter(!date.format %in% '201505') |>
select(-date)
dim(tmp)
[1] 20948 15
set.seed(1746)
data.split = initial_split(tmp,
prop=.85, strata='waterfront')
non_test = training(data.split)
test1 = testing(data.split)
dim(non_test); dim(test1)
[1] 17805 15
[1] 3143 15
# names(non_test); names(test1)
We are now ready to build our model! majority has been
divided into 3 sets:
non_test: The 17805 by 15 non-test set;test1: The 3143 by 15 in-time test set;test2: The 645 by 15 out-of-time test set.Here we will build an Ordinary Least Squares (OLS) regression model with all the variables available to us. (Note that “an OLS regression model” is simply another name for the regular linear regression model.)
Build a OLS model with all the variables available to us. Note that
you might see that R returns NA for the coefficient
estimate for sqft_basement. This means that
sqft_basement is highly correlated with one or more
variables (other than price) in the dataset. Therefore, you
should explicitly exclude sqft_basement in your model. Name
the output of your fitted OLS model ols. Please print all
the coefficient estimates in your report.
Answer
#construct the ols model
ols <- lm(price ~ . - sqft_basement, data = non_test)
summary(ols)
Call:
lm(formula = price ~ . - sqft_basement, data = non_test)
Residuals:
Min 1Q Median 3Q Max
-1.29230 -0.09949 0.00548 0.10552 1.18541
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.209e+01 1.771e-01 68.258 < 2e-16 ***
bedrooms -4.086e-03 2.106e-03 -1.941 0.052333 .
bathrooms 4.042e-02 3.462e-03 11.677 < 2e-16 ***
sqft_living 1.775e-04 4.545e-06 39.064 < 2e-16 ***
sqft_lot 7.070e-07 3.837e-08 18.423 < 2e-16 ***
floors1.5 7.832e-03 5.758e-03 1.360 0.173753
floors2 -2.274e-02 4.846e-03 -4.692 2.73e-06 ***
floors2.5 -4.290e-02 1.762e-02 -2.434 0.014928 *
floors3 -1.301e-01 1.081e-02 -12.038 < 2e-16 ***
floors3.5 -2.555e-01 8.544e-02 -2.990 0.002792 **
waterfront1 6.872e-01 1.698e-02 40.469 < 2e-16 ***
condition2 8.702e-02 4.087e-02 2.129 0.033260 *
condition3 2.288e-01 3.761e-02 6.084 1.20e-09 ***
condition4 2.779e-01 3.762e-02 7.387 1.57e-13 ***
condition5 3.368e-01 3.785e-02 8.897 < 2e-16 ***
grade 1.132e-01 2.299e-03 49.246 < 2e-16 ***
sqft_above 5.836e-05 4.778e-06 12.215 < 2e-16 ***
yr_built -5.910e-04 9.005e-05 -6.563 5.42e-11 ***
yr_renovated 2.894e-03 4.532e-04 6.385 1.75e-10 ***
zipcode98002 -4.958e-02 1.848e-02 -2.682 0.007318 **
zipcode98003 1.773e-02 1.687e-02 1.051 0.293503
zipcode98004 1.092e+00 1.660e-02 65.779 < 2e-16 ***
zipcode98005 7.146e-01 1.968e-02 36.313 < 2e-16 ***
zipcode98006 6.409e-01 1.490e-02 43.021 < 2e-16 ***
zipcode98007 6.385e-01 2.101e-02 30.396 < 2e-16 ***
zipcode98008 6.435e-01 1.679e-02 38.337 < 2e-16 ***
zipcode98010 2.316e-01 2.345e-02 9.878 < 2e-16 ***
zipcode98011 4.433e-01 1.850e-02 23.964 < 2e-16 ***
zipcode98014 2.924e-01 2.152e-02 13.588 < 2e-16 ***
zipcode98019 3.174e-01 1.926e-02 16.479 < 2e-16 ***
zipcode98022 5.504e-02 1.765e-02 3.118 0.001821 **
zipcode98023 -4.004e-02 1.474e-02 -2.717 0.006595 **
zipcode98024 4.040e-01 2.561e-02 15.777 < 2e-16 ***
zipcode98027 4.917e-01 1.527e-02 32.203 < 2e-16 ***
zipcode98028 4.096e-01 1.689e-02 24.251 < 2e-16 ***
zipcode98029 5.790e-01 1.643e-02 35.241 < 2e-16 ***
zipcode98030 4.515e-02 1.702e-02 2.652 0.008010 **
zipcode98031 6.057e-02 1.693e-02 3.579 0.000346 ***
zipcode98032 -6.651e-02 2.165e-02 -3.072 0.002132 **
zipcode98033 7.734e-01 1.511e-02 51.178 < 2e-16 ***
zipcode98034 5.319e-01 1.436e-02 37.047 < 2e-16 ***
zipcode98038 1.664e-01 1.424e-02 11.689 < 2e-16 ***
zipcode98039 1.216e+00 3.276e-02 37.121 < 2e-16 ***
zipcode98040 8.690e-01 1.711e-02 50.787 < 2e-16 ***
zipcode98042 4.830e-02 1.439e-02 3.356 0.000792 ***
zipcode98045 3.271e-01 1.784e-02 18.333 < 2e-16 ***
zipcode98052 6.267e-01 1.426e-02 43.961 < 2e-16 ***
zipcode98053 5.779e-01 1.537e-02 37.592 < 2e-16 ***
zipcode98055 1.236e-01 1.696e-02 7.287 3.30e-13 ***
zipcode98056 3.072e-01 1.538e-02 19.968 < 2e-16 ***
zipcode98058 1.460e-01 1.496e-02 9.759 < 2e-16 ***
zipcode98059 3.301e-01 1.493e-02 22.106 < 2e-16 ***
zipcode98065 4.133e-01 1.649e-02 25.071 < 2e-16 ***
zipcode98070 3.223e-01 2.323e-02 13.876 < 2e-16 ***
zipcode98072 4.853e-01 1.713e-02 28.327 < 2e-16 ***
zipcode98074 5.462e-01 1.511e-02 36.152 < 2e-16 ***
zipcode98075 5.564e-01 1.592e-02 34.951 < 2e-16 ***
zipcode98077 4.297e-01 1.866e-02 23.026 < 2e-16 ***
zipcode98092 2.280e-02 1.600e-02 1.425 0.154233
zipcode98102 9.174e-01 2.376e-02 38.612 < 2e-16 ***
zipcode98103 7.813e-01 1.478e-02 52.855 < 2e-16 ***
zipcode98105 9.107e-01 1.834e-02 49.642 < 2e-16 ***
zipcode98106 2.756e-01 1.619e-02 17.021 < 2e-16 ***
zipcode98107 8.052e-01 1.765e-02 45.617 < 2e-16 ***
zipcode98108 3.157e-01 1.943e-02 16.250 < 2e-16 ***
zipcode98109 9.390e-01 2.346e-02 40.034 < 2e-16 ***
zipcode98112 9.881e-01 1.758e-02 56.212 < 2e-16 ***
zipcode98115 7.869e-01 1.448e-02 54.360 < 2e-16 ***
zipcode98116 7.399e-01 1.637e-02 45.192 < 2e-16 ***
zipcode98117 7.666e-01 1.471e-02 52.106 < 2e-16 ***
zipcode98118 4.276e-01 1.495e-02 28.607 < 2e-16 ***
zipcode98119 9.500e-01 1.970e-02 48.222 < 2e-16 ***
zipcode98122 7.623e-01 1.712e-02 44.539 < 2e-16 ***
zipcode98125 5.486e-01 1.553e-02 35.326 < 2e-16 ***
zipcode98126 5.120e-01 1.616e-02 31.676 < 2e-16 ***
zipcode98133 4.247e-01 1.477e-02 28.751 < 2e-16 ***
zipcode98136 6.595e-01 1.729e-02 38.136 < 2e-16 ***
zipcode98144 6.348e-01 1.630e-02 38.951 < 2e-16 ***
zipcode98146 2.656e-01 1.672e-02 15.883 < 2e-16 ***
zipcode98148 1.589e-01 3.029e-02 5.245 1.58e-07 ***
zipcode98155 3.952e-01 1.504e-02 26.274 < 2e-16 ***
zipcode98166 3.115e-01 1.728e-02 18.024 < 2e-16 ***
zipcode98168 4.878e-02 1.703e-02 2.865 0.004179 **
zipcode98177 6.238e-01 1.755e-02 35.536 < 2e-16 ***
zipcode98178 1.389e-01 1.710e-02 8.124 4.80e-16 ***
zipcode98188 7.854e-02 2.123e-02 3.699 0.000217 ***
zipcode98198 6.996e-02 1.689e-02 4.142 3.46e-05 ***
zipcode98199 8.417e-01 1.664e-02 50.591 < 2e-16 ***
renov_Y.NY -5.709e+00 9.044e-01 -6.312 2.81e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1901 on 17716 degrees of freedom
Multiple R-squared: 0.8692, Adjusted R-squared: 0.8685
F-statistic: 1337 on 88 and 17716 DF, p-value: < 2.2e-16
What is the \(\sqrt{MSE}\) when you predict with your OLS model on the non-test set? What about on the in-time and the out-of-time test sets? Calculate these quantities by using the definition of \(\sqrt{MSE}\). You are not allowed to use any functions that have not been covered in this course to do this part.
Answer
#calculate the RMSE from the non training set
pred_non_test <- predict(ols, newdata = non_test)
mse_non_test <- mean((non_test$price - pred_non_test)^2)
sqrt_mse_non_test <- sqrt(mse_non_test)
#calculate the RMSE on the in time test set
pred_test1 <- predict(ols, newdata = test1)
mse_test1 <- mean((test1$price - pred_test1)^2)
sqrt_mse_test1 <- sqrt(mse_test1)
#calculate the RMSE on the out time test set
pred_test2 <- predict(ols, newdata = test2)
mse_test2 <- mean((test2$price - pred_test2)^2)
sqrt_mse_test2 <- sqrt(mse_test2)
The code below (you will need to install the {broom} and {ggfortify} packages before running the code) creates the scatterplot for residuals v.s. fitted values, and the quantile-quantile plot for the residuals. What do you expect to see if the distribution of the residuals is approximately Normal?
Also, explain what a fitted value is. For example, for a particular house with log-house-price 12.5, what does the fitted value of the log-house-price of this house mean?
broom::tidy(ols)
library(ggfortify)
autoplot(ols, which = c(1:2), size = .2)
Answer When the distribution of the residuals is
approximately Normal, I would expect to see these residuals scatter
around zero. These points will be equally distributed along the
horizontal line of zero with no clear patterns,
curves, or funnel shapes. In the QQ plot, data point should follow the
45 degree diagonal line. A fitted value should be calculated from
coefficient in the ols model and then enter the value of variables to
find the final value. A fitted value is also a predicted value. For
example, for a house with log-house-price 12.5, the fitted value will be
the model’s predicted log-house-price based on the house’s
characteristics.
Here we will build a linear model with the LASSO.
Construct a LASSO regression model using all available variables,
excluding sqft_basement. Make sure that you
waterfront.Use 10-fold cross validations and ask R to choose 200 possible \(\lambda\) values to consider. Repeat the 10-fold cross validation process 3 times. When splitting the dataset for the 10-fold cross validation, please make sure that all folds have about the same proportion of houses that have a view to a water front. Use \(\sqrt{MSE}\) as the criterion for choosing the best \(\lambda\). The the selected \(\lambda\) should be the one that corresponds to the simplest model whose \(\sqrt{MSE}\) is within one SE of the lowest \(\sqrt{MSE}\).
Show the plot for \(\sqrt{MSE}\) v.s. \(\lambda\) values. Report the \(\lambda\) value that you choose and the coefficient estimates of your champion model (it is fine to just print the coefficient estimates as outputs in a code chunk). How many predictors does your champion model have? (Predictors are the X-variables in the mathematical form of your model, so the y-intercept is not a predictor.) Compare this number with that for the OLS model.
Please make sure that you print the coefficient estimates only and not other uninformative quantities, such as the NA values. (Depending on how you code this part, you might not see the NAs at all. Thus, not seeing any NAs is not an indication that something has gone wrong.) Also, please use seed 1746 for all parts that involve a random process.
Hint:
In lecture notes Ch 7.2, the output of coef() is a
matrix. Suppose m is a matrix. Recall that
m[1, ] extracts the first row of m and
m[ ,1] extracts the first column. You should check that
each extracted column is a vector. It might be helpful to inspect the
values in the column vector after you extract the column of
interest.
Answer
library(tidymodels)
library(glmnet)
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loaded glmnet 4.1-10
library(ggplot2)
library(dplyr)
set.seed(1746)
rec <- recipe(price ~ ., data = non_test) |>
#remove sqft_basement from predictors
update_role(sqft_basement, new_role = "exclude") |>
#in case if miss: convert all catgorical variable to factors
step_string2factor(all_nominal_predictors()) |>
#handle unseen levels and missing values in categorical variables
step_novel(all_nominal_predictors()) |>
step_unknown(all_nominal_predictors()) |>
#expand categorical variables into one-hot dummies
step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
#Do not remove low-variance predictors
step_zv(all_predictors(), skip = TRUE) |>
#standardize numeric predictors
step_normalize(all_numeric_predictors())
#Building LASSO model
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) |>
set_engine("glmnet")
#Combine Preprocessing and model into one workflow
wf <- workflow() |>
add_recipe(rec) |>
add_model(lasso_spec)
#10-fold CV repeated 3 times, stratified by waterfront
set.seed(1746)
folds <- vfold_cv(non_test, v = 10, repeats = 3, strata = waterfront)
#Create a grid of 200 candidate lambda values to try
lambda_grid <- grid_regular(penalty(), levels = 200)
#Tune lambda using repeated CV and RMSE
set.seed(1746)
tuned <- tune_grid(
wf,
resamples = folds,
grid = lambda_grid,
metrics = metric_set(rmse),
control = control_grid(verbose = FALSE)
)
#pick the simplest model (largest lambda) whose RMSE is within 1 SE of the best RMSE
best_1se <- select_by_one_std_err(tuned, metric = "rmse", desc(penalty))
# Extract the selected lambda value
best_lambda <- best_1se$penalty
best_lambda
[1] 0.001933892
# ggplot RMSE vs lambda
rmse_df <- tuned |>
collect_metrics() |>
filter(.metric == "rmse") |>
arrange(penalty)
ggplot(rmse_df, aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err), width = 0) +
scale_x_log10() +
labs(x = expression(lambda), y = "RMSE (sqrt(MSE))")
# Fit champion model on full non_test set
champ_fit_1 <- fit(finalize_workflow(wf, best_1se), data = non_test)
# Coefficient estimates
beta_mat <- extract_fit_engine(champ_fit_1) |> coef(s = best_lambda)
beta_vec <- beta_mat[, 1]
names(beta_vec) <- rownames(beta_mat)
beta_nz <- beta_vec[beta_vec != 0]
beta_nz
(Intercept) bathrooms sqft_living sqft_lot
1.304463e+01 2.742830e-02 1.689790e-01 2.436458e-02
grade sqft_above yr_built yr_renovated
1.422791e-01 2.831093e-02 -2.711875e-02 1.058981e-02
floors_X1.5 floors_X2 floors_X3 floors_X3.5
3.392163e-03 -4.243201e-05 -1.246848e-02 -1.550334e-03
waterfront_X0 waterfront_X1 condition_X1 condition_X2
-5.665668e-02 1.539549e-12 -8.216951e-03 -1.475700e-02
condition_X3 condition_X5 zipcode_X98001 zipcode_X98002
-2.038882e-02 1.488602e-02 -5.407378e-02 -4.564817e-02
zipcode_X98003 zipcode_X98004 zipcode_X98005 zipcode_X98006
-4.639385e-02 7.424417e-02 2.179173e-02 2.657530e-02
zipcode_X98007 zipcode_X98008 zipcode_X98010 zipcode_X98014
1.316292e-02 2.067661e-02 -1.232678e-02 -9.111050e-03
zipcode_X98019 zipcode_X98022 zipcode_X98023 zipcode_X98027
-9.397352e-03 -3.802053e-02 -7.012997e-02 4.113833e-03
zipcode_X98028 zipcode_X98029 zipcode_X98030 zipcode_X98031
-2.412765e-03 1.324846e-02 -4.234110e-02 -4.098745e-02
zipcode_X98032 zipcode_X98033 zipcode_X98034 zipcode_X98038
-3.756390e-02 4.394353e-02 1.162973e-02 -4.278237e-02
zipcode_X98039 zipcode_X98040 zipcode_X98042 zipcode_X98045
3.406926e-02 4.536323e-02 -5.951537e-02 -9.654922e-03
zipcode_X98052 zipcode_X98053 zipcode_X98055 zipcode_X98056
2.687361e-02 1.744131e-02 -3.423314e-02 -1.652151e-02
zipcode_X98058 zipcode_X98059 zipcode_X98065 zipcode_X98070
-4.061023e-02 -1.426717e-02 -1.642287e-03 -5.404386e-03
zipcode_X98072 zipcode_X98074 zipcode_X98075 zipcode_X98092
2.445578e-03 1.189524e-02 1.241333e-02 -5.009818e-02
zipcode_X98102 zipcode_X98103 zipcode_X98105 zipcode_X98106
2.890257e-02 4.943724e-02 4.365162e-02 -2.009379e-02
zipcode_X98107 zipcode_X98108 zipcode_X98109 zipcode_X98112
3.506652e-02 -1.072588e-02 3.142234e-02 5.595960e-02
zipcode_X98115 zipcode_X98116 zipcode_X98117 zipcode_X98118
5.161348e-02 3.243647e-02 4.648671e-02 -2.310932e-03
zipcode_X98119 zipcode_X98122 zipcode_X98125 zipcode_X98126
4.184453e-02 3.221579e-02 1.082247e-02 5.303471e-03
zipcode_X98133 zipcode_X98136 zipcode_X98144 zipcode_X98146
-2.440224e-03 2.013469e-02 1.983883e-02 -1.938935e-02
zipcode_X98148 zipcode_X98155 zipcode_X98166 zipcode_X98168
-1.286100e-02 -5.663881e-03 -1.271725e-02 -4.307620e-02
zipcode_X98177 zipcode_X98178 zipcode_X98188 zipcode_X98198
1.630363e-02 -3.263009e-02 -2.729514e-02 -4.027804e-02
zipcode_X98199
4.355356e-02
# Number of predictors in champion model
n_pred_lasso <- sum(beta_vec[-1] != 0)
n_pred_lasso
[1] 84
#My model has 84 predictors
# Number of predictors for OLS model
prep_rec <- prep(rec, training = non_test)
X <- bake(prep_rec, new_data = non_test) |> select(-price)
n_pred_ols <- ncol(X)
n_pred_ols
[1] 104
#The OLS model has 104 predictors
#the predictors in my LASSO model is 20 less than the ols model
plot.dat, 2 for
the graph, 1 for the value for b)We will consider additional variables to try to improve our model.
Create a data frame, plot.dat, and use it to make the
scatterplot for the average of log(price) v.s. the
number of bedrooms for the houses in majority. From the
scatterplot we can see that for houses with b or more
bedrooms, on average the more bedrooms a house has, the lower the sold
price would be. Provide the value of b.
Answer
library(dplyr)
library(ggplot2)
# Make the plot: average log(price) vs bedrooms in majority
plot.dat <- majority |>
group_by(bedrooms) |>
summarise(avg_log_price = mean(log(price), na.rm = TRUE), .groups = "drop") |>
arrange(bedrooms)
# scatterplot
print( ggplot(plot.dat, aes(x = bedrooms, y = avg_log_price)) +
geom_point() +
labs(x = "Bedrooms", y = "Average log(price)") )
#the b is very close to 8, the scatter plot is first an increasing diagonal
#line and then decreasing. B is the value when the two lines intersect
We will improve our model by creating an additional variable for our model to consider.
For the non-test set, add an additional factor column that indicates
if a house has 8 or more bedrooms and name this column
eight.plus. Also do this for each of the test sets.
Answer
library(dplyr)
non_test <- non_test |>
mutate(
eight.plus = factor(bedrooms >= 8, levels = c(FALSE, TRUE))
)
test1 <- test1 |>
mutate(
eight.plus = factor(bedrooms >= 8, levels = c(FALSE, TRUE))
)
test2 <- test2 |>
mutate(
eight.plus = factor(bedrooms >= 8, levels = c(FALSE, TRUE))
)
Improve your recipe in Part a by including the interaction term for
bedrooms and eight.plus, and the interaction
term for sqft_living and waterfront.
(Hint: The function to add interaction terms is
step_interact(). Please check out its help page. In
particular, see the example on the help page that deals with dummy
variables and interaction terms. Note that the order you put in the
recipe steps matters. You should see how each step in the recipe affects
your dataset as we did in Week 10 PLE videos (see the video labeled
“Video_10.02_Recipe”) by running the bake() and
prep() functions.)
Refit your LASSO model with these additional terms. Again, report the coefficient estimates of your champion model (it is fine to just print the coefficient estimates as outputs in a code chunk).
Answer
library(tidymodels)
library(glmnet)
set.seed(1746)
# Make a recipe with interactions:c1) bedrooms:eight.plus 2) sqft_living:waterfront
rec_int <- recipe(price ~ ., data = non_test) |>
update_role(sqft_basement, new_role = "exclude") |>
# ensure categoricals are factors
step_string2factor(all_nominal_predictors()) |>
# handle unseen levels and missing values in categorical variables
step_novel(all_nominal_predictors()) |>
step_unknown(all_nominal_predictors()) |>
# expand categoricals
step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
step_interact(terms = ~ bedrooms:all_of("eight.plus_TRUE.")) |>
step_interact(terms = ~ all_of("sqft_living"):starts_with("waterfront")) |>
# Keep remove low-variance predictors
step_zv(all_predictors(), skip = TRUE) |>
# standardize numeric predictors
step_normalize(all_numeric_predictors())
# LASSO workflow
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) |>
set_engine("glmnet")
wf_int <- workflow() |>
add_recipe(rec_int) |>
add_model(lasso_spec)
# Repeated stratified CV (3 repeats) on the non-test set
set.seed(1746)
folds_int <- vfold_cv(non_test, v = 10, repeats = 3, strata = waterfront)
lambda_grid <- grid_regular(penalty(), levels = 200)
set.seed(1746)
tuned_int <- tune_grid(
wf_int,
resamples = folds_int,
grid = lambda_grid,
metrics = metric_set(rmse),
control = control_grid(verbose = FALSE)
)
# 1-SE rule (simplest within 1 SE of best RMSE)
best_1se_int <- select_by_one_std_err(
tuned_int,
metric = "rmse",
desc(penalty)
)
best_lambda_int <- best_1se_int$penalty
best_lambda_int
[1] 0.001933892
# Fit champion model on full non-test set
champ_wf_int <- finalize_workflow(wf_int, best_1se_int)
set.seed(1746)
champ_fit_int <- fit(champ_wf_int, data = non_test)
# Print coefficient estimates
beta_mat_int <- extract_fit_engine(champ_fit_int) |> coef(s = best_lambda_int)
beta_vec_int <- beta_mat_int[, 1]
names(beta_vec_int) <- rownames(beta_mat_int)
beta_nz_int <- beta_vec_int[beta_vec_int != 0]
beta_nz_int
(Intercept) bathrooms sqft_living
1.304463e+01 2.781840e-02 1.692849e-01
sqft_lot grade sqft_above
2.433777e-02 1.418882e-01 2.831339e-02
yr_built yr_renovated floors_X1.5
-2.736713e-02 1.053222e-02 3.292042e-03
floors_X2 floors_X3 floors_X3.5
-4.411209e-05 -1.245409e-02 -1.196059e-03
waterfront_X0 waterfront_X1 condition_X1
-5.663056e-02 1.549764e-12 -8.225782e-03
condition_X2 condition_X3 condition_X5
-1.475762e-02 -2.032407e-02 1.484141e-02
zipcode_X98001 zipcode_X98002 zipcode_X98003
-5.405622e-02 -4.563950e-02 -4.637663e-02
zipcode_X98004 zipcode_X98005 zipcode_X98006
7.439725e-02 2.179374e-02 2.662295e-02
zipcode_X98007 zipcode_X98008 zipcode_X98010
1.317231e-02 2.068411e-02 -1.231328e-02
zipcode_X98014 zipcode_X98019 zipcode_X98022
-9.099997e-03 -9.390084e-03 -3.799621e-02
zipcode_X98023 zipcode_X98027 zipcode_X98028
-7.010580e-02 4.121563e-03 -2.405545e-03
zipcode_X98029 zipcode_X98030 zipcode_X98031
1.326827e-02 -4.232922e-02 -4.097664e-02
zipcode_X98032 zipcode_X98033 zipcode_X98034
-3.754629e-02 4.395429e-02 1.164417e-02
zipcode_X98038 zipcode_X98039 zipcode_X98040
-4.277227e-02 3.405798e-02 4.536062e-02
zipcode_X98042 zipcode_X98045 zipcode_X98052
-5.948852e-02 -9.635354e-03 2.689335e-02
zipcode_X98053 zipcode_X98055 zipcode_X98056
1.745581e-02 -3.417170e-02 -1.650574e-02
zipcode_X98058 zipcode_X98059 zipcode_X98065
-4.059001e-02 -1.421588e-02 -1.652657e-03
zipcode_X98070 zipcode_X98072 zipcode_X98074
-5.379382e-03 2.458150e-03 1.191607e-02
zipcode_X98075 zipcode_X98092 zipcode_X98102
1.242021e-02 -5.007505e-02 2.899638e-02
zipcode_X98103 zipcode_X98105 zipcode_X98106
4.944545e-02 4.393791e-02 -2.002700e-02
zipcode_X98107 zipcode_X98108 zipcode_X98109
3.512592e-02 -1.072806e-02 3.142469e-02
zipcode_X98112 zipcode_X98115 zipcode_X98116
5.610629e-02 5.162770e-02 3.245046e-02
zipcode_X98117 zipcode_X98118 zipcode_X98119
4.650223e-02 -2.305599e-03 4.184549e-02
zipcode_X98122 zipcode_X98125 zipcode_X98126
3.228005e-02 1.083499e-02 5.326704e-03
zipcode_X98133 zipcode_X98136 zipcode_X98144
-2.369011e-03 2.014982e-02 1.989829e-02
zipcode_X98146 zipcode_X98148 zipcode_X98155
-1.937725e-02 -1.285833e-02 -5.647182e-03
zipcode_X98166 zipcode_X98168 zipcode_X98177
-1.271305e-02 -4.306409e-02 1.631111e-02
zipcode_X98178 zipcode_X98188 zipcode_X98198
-3.262984e-02 -2.729362e-02 -4.026095e-02
zipcode_X98199 eight.plus_FALSE. eight.plus_TRUE.
4.355900e-02 3.732245e-03 -1.510602e-14
pred_non <- non_test |>
mutate(.pred = predict(champ_fit_int, non_test)$.pred)
pred_in <- test1 |>
mutate(.pred = predict(champ_fit_int, test1)$.pred)
pred_out <- test2 |>
mutate(.pred = predict(champ_fit_int, test2)$.pred)
rmse_non <- rmse(pred_non, truth = price, estimate = .pred)$.estimate
rmse_in <- rmse(pred_in, truth = price, estimate = .pred)$.estimate
rmse_out <- rmse(pred_out, truth = price, estimate = .pred)$.estimate
What is the \(\sqrt{MSE}\) when you predict with your Champion model on the non-test set? What about on the in-time and the out-of-time test sets?
Side note 1: If the LASSO picks most of the variables, then it is not surprising that the error rate is not improved from the error rate for the OLS model.
Side note 2: Additional improvement can be make by combining zipcodes that have similar average house prices into the same category; however, due to time constraint, we will not do that here.
Answer
The RMSE for the non test is 0.1908648, in time set is 0.1939395, and out-of-time is 0.2207851
The final champion LASSO model retains most of the main structural predictors such as bathrooms, square footage measures, grade, and year built, as well as a large number of zipcode indicators. This suggests that regularization does not substantially reduce model complexity relative to OLS.
The indicator variable eight.plus has a coefficient essentially equal to zero, indicating that after controlling for other predictors, distinguishing houses with eight or more bedrooms does not provide additional predictive power.
Overall, this is consistent with the observation that the LASSO model does not significantly improve prediction accuracy relative to the OLS model, as noted in Side Note 1.
This is a subset of a dataset from kaggle.com. We will use only a subset of the variables in the dataset because we only want to include variables that have clear definitions and seem relevant for house prices.↩︎